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Abstract 



Reaction-times in perceptual tasks are the subject of many experimental and the- 
oretical studies. With the neural decision making process as main focus, most of 
these works concern discrete (typically binary) choice tasks, implying the identifi- 
cation of the stimulus as an exemplar of a category. Here we address issues specific 
to the perception of categories (e.g. vowels, familiar faces, ...), making a clear dis- 
tinction between identifying a category (an element of a discrete set) and estimating 
a continuous parameter (such as a direction). We exhibit a link between optimal 
Bayesian decoding and coding efficiency, the latter being measured by the mutual 
information between the discrete category set and the neural activity. We charac- 
terize the properties of the best estimator of the likelihood of the category, when 
this estimator takes its inputs from a large population of stimulus-specific coding 
cells. Adopting the diffusion-to-bound approach to model the decisional process, this 
allows to relate analytically the bias and variance of the diffusion process underly- 
ing decision making to macroscopic quantities that are behaviorally measurable. A 
major consequence is the existence of a quantitative link between reaction times 
and discrimination accuracy. The resulting analytical expression of mean reaction 
times during an identification task accounts for empirical facts, both qualitatively 
(e.g. more time is needed to identify a category from a stimulus at the boundary 
compared to a stimulus lying within a category), and quantitatively (working on 
published experimental data on phoneme identification tasks) . 



1 Introduction 

This paper addresses issues specific to the perception of categories (e.g. vowels, fa- 
miliar faces, colors, ...), making a clear distinction between identifying a category (an 
element of a discrete set) and estimating a continuous parameter (such as a direc- 
tion). Categorization is long known to have an influence on perceptual judgments, 
as illustrated by many experiments based on discrimination and/or categorization 
tasks. In particular, a perceptual phenomenon called categorical perception states 
that discrimination ac curacy is highe r at the boundary between categories than 



within a category (see iHarnadl . 119871 . for a review). This phenomenon has been 
much studied by psycholinguists in the case of phonemic categories - languages dif- 
fering by the number and the distribution of their phonemic categories, language 
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acquisition indeed enta i 



Werker and Teesl . [1984; 



s specific perceptual abilities (lAbramson and Lisker 



Kuhl et al. 



1992 



Polka and Werkei. 



1970; 



19941 ) . In addition to 



discriminability and categorization performances, many studies have measured reac- 
tion times. In the case of phoneme identification tasks (in all that f ollows, identifica- 



Pisoni and Tash 



tion w ill denote identification of a category), it has been noted by 
(119741 ) that "reaction time is a positive function of uncertainty, increasing at the 
phonetic boundary where identification is least consistent and decreasing where 
identification is most consistent". These authors thereby noted that "identifica- 
tion time is slowest for the stimulus region where discrimination is best." Although 
this remark was formulated several decades ago, little attention has been given to the 
understanding of the link between these two phenomena, discrimination and identi- 
fication time. Such understanding first requires to take into account the existence of 
a stimulus- dependent perceptual noise, second to determine how this affects decision 
making, and last to study how learning or adaptation jointly shapes perceptual noise 
and reaction times. However, in previous models of categorization, discriminability 
is usually considered as a scale parameter, constant alo ng a given relevant stimu- 
lus o r ps ychological dimension, as i n exemplar models (jNosofskyi . Il986l ; iKruschkd . 



19921 ). In lAshby and Maddox! (119931 ). the possibility of having a stimulus- dependent 



discriminability is for the first ti me considered, and l ater t a ken int o acco unt in the 



computation of reaction times ( lAshbv and Maddoxl . 1 19941 ; lAshbvl . [2000j). Yet, to 
our knowledge, the information processing nature of both the discriminability and 
its link with categorization has never been explored. One of the main outcome of 
the present work is precisely to derive, from the hypothesis of optimal decoding, 
an analytical stimulus-dependent relationship between mean reaction times and dis- 
crimination accuracy. 



The above-mentioned psycholinguistic studies are particularly interesting for 
they exemplify how category learning affects both stages of neural processing, the 
encoding (the building of a neural representation of the stimuli) and the decoding 
(the reading-out of the categorical information and the decision-making process) 
ones. We assume that the former stage characterizes performances in a discrimi- 
nation task, and that the latter is revealed by the measure of reaction times in an 
identification task. A perceptual system that encodes categories aims at minimiz- 
ing the probability of misclassifying incoming stimuli. It has to face two sources of 
uncertainty. The first one is independent of the neural code and lies in the intrinsic 
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confusion between classes. For instance, vowels typically overlap in stimulus space. 
The second sou rce of uncertainty comes from the n oisy response of the neurons. In a 
previous work (IBonnasse-Gahot and Nadall . 120081 ). we showed how these two types 
of noise interact at the coding level. More precisely, adopting a population coding 
scheme and making use of information theoretic tools, we quantified the coding ef- 
ficiency of a neural representation with respect to a set of categories by means of 
the mutual information (a measure of statistical dependency) between the set of 
categories and the neural activity. We showed that this information is essentially 
proportional to the ratio between two Fisher information values: in the numerator 
a term that is independent of the neural code and that quantifies the overlap be- 
tween categories in stimulus space; in the denominator a term that only depends on 
the neural code and that quantifies the sensitivity of the population code to small 
variations in the input space. An optimized code (resulting from either learning or 
evolutionary adaptation) is then realized by allocating more neuronal resources at 
the boundary between classes in order to have a greater Fisher information value of 
the neuronal population in this region, which implies a better sensitivity. In other 
words, if the code is optimized, discrimination is greater between categories than 
within a category, hence categorical perception. In this previous work, optimality 
is defined as maximization of the mutual information between the categories and 
the coding layer. No issue specific to the decoding stage was addressed there, even 
though maximizing the mutual information amounts to minimizing the probability 
of an ideal observer to misclas sify an incoming stimulus (this property is formally 
given by Fano's inequality; see 



ment, and 



Cover and Thomas 



Bonnasse-Gahot and Nadal 



2006 



2008 



^2.10, for a general state- 
^2.2, for a formulation of this inequality 



within the present context). 



In the present work, we focus on the decoding stage, in particular on how it 
depends on the stimuli characteristics and on the efficiency of the coding stage. 
We show how the two types of noise (the two types of Fisher information values 
mentioned above) play a crucial role in the optimal decoding properties, hence in 
particular in shaping the reaction times. To derive our results, we work within a 
probabilistic framework, and consider a ne u ral m odel based on a population encoding 



scheme as in 



Bonnasse-Gahot and Nadall (120081) , and c losely r e lated t o other neu- 



(2007 


) and 


Beck et al. 


(2008) 



( 120081 ). In particular, the model can be seen as an extended 
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version of the covering model proposed by iKruschkd (119921 ) . where the covering map 
is here interpreted as a neuronal population (with noisy activities), each neuron 
being specific of some region in the input space, and with the addition of a decision 
process based on a rando m walk dynamics. I t can also be seen as a simplified version 
of the SPEED model of lAshby et al.l (120071 ) - leaving aside the learning issues not 



addressed here -, in a way which allows for analytical results. The chosen model is 
precisely a compromise between biological plausibility and mathematical simplicity, 
allowing for analytical treatments. After a detailed presentation of the model in 
Section [2j we then proceed in Section [3] with the two following main points. 

• First point: optimal read-out. We study a decoding layer that provides an estima- 
tion of posterior probabilities. We derive the theoretical properties of the optimal 
Bayesian decoder of the categorical information embedded in the coding layer. A 
crucial result is the derivation of a relationship between optimal decoding from a 
Bayesian point of view, and encoding efficiency as quantified by the mutual informa- 
tion between the neural activity and the categories. In particular, this relationship 
shows that maximizing information makes it possible to have a better estimate (in 
the sense that its variance is reduced) of the posterior probabilities giving the like- 
lihood of a class knowing a stimulus in the transition regions between categories, 
which are the main sources of classification errors. Then, and quite importantly, we 
show that the neural parameters (tuning curves in the coding layer and synaptic 
weights for the decoding layer) of the considered architecture can be adapted to 
provide the optimal estimator as output of the network. 

• Second point: decision process. We consider the decision making mechanism as a 
diffu sion model applied to the outpu t of our network. Fir s t intr oduced in psychol- 
ogy ( Link and Heath . 1 19751 : Ratclifi . 



1978 



Ratclifi et al. 



19991 ) . diffusion models 



have been proposed as general models of decision-making, notably to account for 
reaction times. Roughly speaking, in the case of a two-alternative choice, diffusion 
models assume that a decision variable, that carries evidence accumulated in fa- 
vor of one or the other choice, evolves stochastically over time until it reaches some 
threshold, leading to the decision. This type of models has more r ecently gained con- 
sider ation in the field of neuroscience ( jSmith and Ratclifi . 



2004 



Gold and Shadlen . 



20071 ) . The general theory of first passage times allows one to analytically express the 



mean reaction times during a category identification task. Although the mathemat- 
ical foundations of this theory are general, most neuroscience applications assume 
that the variance of the decision variable is independent of the presented stimulus 
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Huk and Shadlen 



20051 ) . Some models take into account the possibility of 



(see e.g. 

a stimulus dependent variance, thus exhibiting a relationship between perceptual 



Ashbv and Maddoxl . 



1994 



AshbyJ, 12000J) . However 



noise and reaction times (see e.g. 
none of these works consider how both bias and variance in the diffusion model 
depend on the stimulus when assuming optimal decoding. In the present paper, 
the dependency in the stimulus - in both its categorical specificity and its encoding 
quality - is crucial. For the considered architecture, building on the first point (op- 
timal read-out) which studies the interplay between coding efficiency and optimal 
decoding, we exhibit a quantitative link between reaction times and discrimination 
as a function of the stimulus. With the aim of comparing the predictions of our 
model with behavioral data, we make a link between microscopic quantities (tuning 
curves of the neurons, synaptic weights) and macroscopic quantities (discrimination 



accuracy). The resulting formula makes it possible to model quantitative 



Ylinen et al. 



y mea n 
I2OO5J): 



reaction times obtained in a psycholinguistic experiment by 
our analysis allows to better analyze the difference in behavior between two groups, 
one for which one may expect that encoding has been efficiently adapted to the 
considered stimuli, and one for which this is not the case. 



Finally, in Section HI we put the emphasis on the analysis of the interplay between 
identification and discrimination as revealed by psycholinguistic studies on phonemic 
perception, and on the confrontation with neurophysiological data, and discuss the 
possible extensions of the model. 



2 Model 

2.1 Identification of categories: probabilistic framework 

We consider M categories, subscripted by // = 1, . . . , M, and characterized by a 
probability of occurrence q^, so that = 1> anc ^ a density distribution P(x|/i), 

where x denotes the stimulus. For instance, x might represents the voice onset time 
(VOT) dimension in the case of stop consonants (1 G 1), or the two or three first 
formants in the case of vowels (x G R 2 or I 3 ). However, for simplicity, in all what 
follows we will assume that the stimulus space is unidimensional, that is x G M 
(although our general theory is easily generalized to the multidimensional case). 
A stimulus x elicits a response r = [rx, . . . ,tat} from a population of A" neurons 
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that aims at encoding categorical information in a distributed fashion. The neural 
activity r depends on the class /i only through the sensory input x: 

We restrict our analysis to the following conditions: 

1. for any neural activity r there is a uniquely defined stimulus value x which 
maximizes the likelihood of r given x\ 

2. the system operates in a regime of high signal-to-noise ratio: x is a good 
approximation of x (e.g. N is large and x converges to x as N goes to infinity). 

The large N limit, which is the appropriate regime for modeling a population code, 
allows to have a high signal-to-noise ratio even with noisy individual neurons. From 
the mathematical point of view, it allows to obtain analytical results, with the inter- 
esting properties typically given by terms of order 1/N - the first non trivial terms 
in the large N limit. Similar results would be obtained for a small number of cells, 
with low noise or in the large time limit, provided the mean firing rates are functions 
of x allowing to get a good estimate of the stimulus value. 



For what concerns the read-out, we will assume that, given a neural activity r in 
the coding layer, the goal is to construct as neural output an estimator g(fj,\r) of the 
posterior probability P(fi\x), where x indicates the (true) stimulus that elicited the 
neural activity r. For a given stimulus x and a neural activity r, the relevant quality 
criterion is given by the divergence (or improperly, the distance) C(x, r) between the 

true probabilities {P(fj,\x), fj, = 1, ...,M} and the estimator {q (/x|r), ix = 1 M| , 

defin ed as the Kullback-Leibler divergence (or relative entropy) ( ICover and Thomas . 
2006|) 



M 

l-'i 1 1 r ) 

(2.2) 



C(*,r)=f>(^)ln^ 



Averaging over r given x, and then over x, the mean cost induced by the estimation 
can be written: 

C = -n(/i\x)- / dxp{x) / d N r P(r\x)^2P{fi\x)lng(fi\r) (2.3) 
J J n 

where 7i([i\x) = — f dxp(x)^2^ =1 P(fi\x)lnP(fi\x) is the conditional entropy of fi 
given x. In Section [31 we will study the properties of the optimal estimator - optimal 
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in the sense that it minimizes the above cost function (12. 3p and discuss its neural 
implement at ion . 

Note that our hypothesis on the optimali ty criterion of the read-out is to be 



Beck ct al. 



(120081 ) modeling random 



contrasted with other approaches, such as in 
dot discrimination task experiments. There, the discreteness of the classes is not 
taken into account from the point of view of optimal information processing: the 
network makes its decision from the optimal estimation of a continuous variable (the 
global direction of the stimulus). 



2.2 Neural modeling 

We now consider a plausible neural architecture. We assume a standard population 
coding scheme for the coding layer, followed by a decoding layer. This feedforward 
information processing is illustrated in Figure [TJ 



Category 



Stimulus 
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Decoding 



Estimation of 
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Decision 
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+7 



I- 1 















Figure 1: Model architecture. Given a category //, a stimulus x is produced according to 
some pdf P(x\fi). The stimulus is encoded by a large population of neurons with stimulus- 
specific tuning curves. If the code has been optimized, more resources are allocated to the 
boundaries between categories in stimulus space, with tuning curves having steep slopes 
in the transition regions. The information conveyed by the activity of this coding layer is 
extracted by the decoding layer. Thanks to an adaptation of the synaptic weights between 
the encoding and the decoding layer, the activity of the output cells (one per category) 
directly reflects category membership, by estimating the Bayesian posterior probabilities 
of the categories given a stimulus. The activities of these decoding units are the basis of 
the decision-making mechanism. In the case of two categories, the difference in activity 
between the two output cells acts as the decision variable through a diffusion process. 
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2.2.1 Population code 

For the coding layer - which we assume to characterize the perceptual level -, 
we consider an assembly of a large number N of cells, with activities denoted r = 
{ri, . . . , rjv}- Each cell % is stimulus-selective, with a mean response characterized by 
a tuning curve fi(x) that peaks at its preferred stimulus x iy and decreases according 
to some parameter (the width of the tuning curve). For simplicity we will assume 
that the r^'s are independent random variables given a stimulus x (we will come 
back later on the important issue of correlations): 



i=i 

and we will assume Poisson statistics. Hence the mean number of spikes emitted 
during a time window [0, r] and its variance are equal to rfi(x): 



where (.) x indicates the integration over r given x. 
2.2.2 Decoding layer: reading-out 

The decoding layer aims at extracting categorical information from the neural pop- 
ulation activity. This decoding layer features M cells, each one connected to the N 
neurons of the coding layer. The activity of decoding cell /i is given by a function 
g(/i\r, w), which will be interpreted as an estimator of the class likelihood, where the 
adaptable parameters are the synaptic weights w = {w^, i = 1,...,N, /i = 1, M}, 
w^i being the synaptic weight from the coding cell i to the decoding cell /i. In order 
to constrain the neuronal activities so that the g(n\r, w) can be interpreted as proba- 
bilities, we make the ad hoc (but standard) choice of a normalization with a softmax 
nonlinearity, which constitutes a continuous generalization of the 'winner-take-all' 
operation. For a given /i, the g(fi\r, w) is then defined as: 



N 



P(r\x) = Y[Pi(ri\x) 



(2.4) 




(2.5) 




(2.6) 



where, for every v G {!,..., M} 



N 




(2.7) 



i=l 
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We now show that the estimator has a Gaussian distribution. Since the number 
iV of neurons is large, and the activities of the neurons being independent given an 
input x, according to the (Lyapunov's generalization of the) central limit theorem 

is characterized by a Gaussian distribution with mean ~z~: 

z~=T^2w fli f i (x) (2.8) 

i 

and variance v(z ft ): 

«W=T (2-9) 

i 

Recall that rfi(x) represents the mean number of spikes emitted during a time 
window [0, t}. For large N and (possibly) large observation time r, in order to have 
of order 1, the weights must be of order 1/Nt, and then v{z ll ) is of order 
1/Nt. Developing Eq. (12. 6p at first order in 1/Nt shows that g(fi\r, w) also follows 
a Gaussian distribution, with a mean g^, the average of g(fi\r, w) over r at a given 
value of x, of order 1, and a variance v(g^), of order 1/Nt. We will consider the 
expressions of these mean and variance in Section [3j 



2.2.3 Decision making from a diffusion process 



The model is now completed by introducing the decision-making mechanism, which 
we present within the general framework of diffusion models, which assume that 
information gets accumulated over time in favor of one or the other category until 
a threshold is reached, leading to the decision. This kind of models has been first 



i ntroduc e d in psychology and 
ptatclifl . 



1978; 



Ratcliff et al. 



supports (jSmith and Ratcliff 



jrovi de good quantitative fits of psychophysical data 



1999J) . Recently, it has found strong neurobio logical 



2004: 



Gold and Shadlen. 



2003). 



The analysis presented so far is valid for any number M of categories. However, 
random walk or diffusion model only apply to two alternatives cases. In this part, 
and whenever appropriate, we thus restrict ourselves to the study of a two-category 
case. Generalizing the results to more than two categories would require co nsidering 
other types of decision-making models such as accumulator mo dels (see e.g. IVickersl . 



1970 



Usher and McClellandl . 



2001 



Bogacz and Gurneyl . 120071 ). 



As just said, a diffusion model assumes that information gets accumulated over 
time in favor of one or the other category until it reaches a given threshold, leading 
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to the decision flLink and Heathl . Il975l : iRatclifl . ll978L iRatcliff et all . Il999f ). This 
information is conveyed by a decision variable that favors one or the other category. 
When this variable, initially zero (whenever there is no preexisting bias), reaches 
the positive bound (notated +7), the category corresponding to this bound (say 
category 2) is chosen. Conversely, when this variable reaches the negative bound 
(located in —7), the other category (category 1 here) is chosen. As a consequence 
of the noise characterizing the temporal evolution of the decision variable, for the 
very same stimulus different trials might lead to different choices, and to different 
reaction times. For the neural architecture studied here, the decision variable that 
we consider is the difference between the output activities z 2 ^{^) and ^i, T ( r ) ~~ that 
is the difference between the logarithm of the probabilities, log g{2\r, w)/g(l|r, w). 
For a given time window [0, r], this difference, notated a T (r), is thus 



a T {r) = z 2 , T (r) - z liT (r) 



w 2 i - w u r { 



(2.10) 



where Ti is the number of spikes emitted by neuron % during the time window [0, r]. If 
a T (r) reaches the upper bound, +7, (respectively the lower bound, —7), the chosen 
decision is category 2 (resp. category 1). 

As seen before, the number N of neurons being large, and the activities of the 
neurons being independent given an input x, one can make use of the central limit 
theorem and state that a r (r) is characterized by a Gaussian distribution, and from 
(12.1 Op and (12. 5p . one can write its mean a(x) and variance v a (x) as: 



MX) 



v a (x) 



E< 



wu)fi(x) = ra (x) 



T 



y^jw2i - Wu) 2 fi(x) = TV° a (x) 



(2.11) 
(2.12) 



We have introduced the variables a and in order to make explicit the dependency 
in the time r. Our diffusion process is thus characterized by the mean a(x) and the 
variance v a (x): both dep end on the stimulu s x, no t only the mean as often assumed 
in the literature (see e.g. 



Huk and Shadlen 



2005|). 



Section [3] will derive the mean time to reach one of the two bounds +7 or —7 (ie the 
mean reaction time), and characterize the mean and variance of the decision variable 
in terms of both posterior probabilities of the categories and neural sensitivity of 
the coding layer. 
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3 Results 



This section develops and demonstrates the two main points of this paper, each of 
them consisting of two steps: (la) characterization of the properties of the optimal 
decoder; this part, which might be found lengthy and technical, is however manda- 
tory in order to understand how the efficiency of category identification (read-out 
accuracy measured by the appropriate Cramer-Rao bound) is intrinsically linked 
to the efficiency of the encoding stage (measured by an information content); the 
results of this part are a crucial intermediate step upon which the analysis of the 
reaction times is built; (lb) within our neural model, neural implementation of the 
optimal decoder; (2a) characterization of the mean reaction times as a function of 
the stimulus, assuming that the neural code has been optimized and (2b) interpre- 
tation of microscopic quantities (tuning curves of the neurons, synaptic weights) 
in terms of macroscopic quantities (discrimination accuracy). The results are then 
illustrated by numerical simulations and confronted with experimental data. 

3.1 Optimal read-out: estimation of the posterior probabil- 



3.1.1 Characterization of the optimal estimator 

Given a neural code, we here characterize, independently of the particular imple- 
mentation of the decoding layer, the theoretical properties of the optimal estimator 
of the posterior probabilities of a category knowing a stimulus. 
One can easily show that the estimator minimizing the cost function (12.31) is 



One can expect the optimal estimator P(fi\r) to be unbiased and efficient in the large 
N limit. We show below that this is the case at leading order in 1/N. In doing so, 
we derive from the Cramer-Rao bound an optimal bound for our cost function, and 
provide an explicit link between the Bayes and the information theoretic approaches. 

An unbiased estimator. Under the hypotheses presented above, one can show that 

- up to a correction, hence a bias, of order 1/N that we will neglect in the following 

- one has: 



ities 




(3.13) 




(3.14) 
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Note that, because of the processing chain // — > x — > r, the left-hand side of the 
above equation is not identically equal to P([i\x) (to be convinced, consider the zero 
signal-to- noise ratio case where the neural activity r does not depend on x). 



Cramer-Rao bound. We here derive an optimal bound for the mean cost, Eq. (12.31) . 
Let us consider an unbiased estimate g(/i|r) of the posterior probability P(/i|x) (that 
is J d N r q(u\r) P(r\x) = P(u\x)). For suc h an estimate, the Cramer-Rao inequality 
writes (see e.g. Cover and Thomasl . 2006 . §11.10): 

/ d N rP(r\x)(g^\r)-P^\x)) 2 > (3.15) 

J " code % ) 

where P'(/j,\x) = dP(n\x)/dx, and Pcode^) is the Fisher information characterizing 
the sensitivity of r with respect to small variations of x: 

F code (x) = - J d N r d 2ln ^ 2 (r|x) P(r\x). (3.16) 

Now we rewrite the mean cost induced by the estimation, Eq. (12.31) . as: 

C = - J dx p(x) J d N r P(r\x) P (v\ x ) ln ( 3 - 17 ) 

For the typical values of r given a stimulus x, g(fi\T) has to be close to P(/i|x), so 
that: 

b g(fi\r) =]n f 1+ Mr) ~ P(n\x)\ _ g(n\r) - P{fi\x) 1 (g(fx\r) - P(fx\x)) 2 



P(ji\x) V. P{lAx) J P(^\x) 2 P(/i|x) 2 

(3.18) 

Substituting this expansion within (13.17P and using the fact that g(n\r) is an unbi- 
ased estimate of P(fi\x), we can write 

C= l -J dxp(x)Y,j^J d N rP(r\x)(g(fi\r)-P^\x)) 2 (3.19) 

Hence, making use of the Cramer- Rao inequality (I3.15p . we get that the mean cost 
satisfies: 

If, , , Fct(x) 



c> -2j dxpix) €^) (3 - 20) 

where F co d e {x) is the Fisher information fl 3 . X 6 1) . and F ca t(x) is the Fisher information 
that characterizes categorization uncertainty (which will be henceforth called the 
category-related Fisher information): 

/ \ v—«. d 2 ln P(u\x) , . , . 

Pcat(x) = ~J2 P (M- ( 3 -21) 

n 
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Note that P C ode is of order N and P cat of order iV° = 1, so that the bound is of 
order 1/N. Moreover, if the estimator has a bias of order 1/N (as this is the case 
below considering P(/i |r)), one can show that the contribution of this bias to the 
Cramer- Rao bound is of order 1/N 2 , so that Eq. f)3.20p remains valid. 



An efficient estimator. If we now replace g(/u|r) in Eq. ( 12 .3p by its optimal value 
P(yu|r), we get an interesting expression of the cost at the optimum, which is a 
difference between two mutual information values. Indeed, one can write C = 
H([i\r) — T-L(fj,\x), that is 

C = I(/x, x) -/(//, r) (3.22) 
where /(/i, x) is the mutual information between the categories // and stimulus x, 

I(ji,x) = J2 q »[ dxP ( x \») In^yy^ (3.23) 
M=1 J P{ x ) 

and J(/i, r) the mutual information between /i and the neural activity r 

I(/x,r) = f> / d N rP(v\») ln^^ (3.24) 
[i=i J \ ' 

In iBonnasse-Gahot and Nadall (120081 ). we have shown that, in the large signal- 



to-noise ratio limit which we consider here, the difference /(//, x) — I(fi, r) which 
appears in the above equation (13 . 22 j) is given by: 

I(fi, x) - I(fi, r) = i j dxp(x) p Cat( f I (3.25) 

that is precisely by the right hand side of the inequality f l3.20p . Hence, for the 
estimator P(/x|r), this inequality is an equality, which means that the Cramer- Rao 
bound is saturated. The probability distribution {P(/i|r),/i = 1, ...,M} is thus an 
estimator of P{jj\x) that is (asymptotically) unbiased and (asymptotically) efficient. 



Information theoretic view point on Bayesian inference. En passant, we have thus 
shown that the decoding cost, for the optimal estimator, is directly related to the mu- 
tual information between the categories and the neural code. This result is in agree- 
ment with previous results in the field of statistical inference bas ed on an information 
theor e tic approach to Bayesian i nference and ne u ral coding (IClarke and Barron , 



1990 



Haussler and Opper 



1995 



Rissanen 



1996 



Herschkowitz and Nadal 



1999; 
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BialeketaL 



20011 ): in words, the best estimator cannot do better than extract- 



ing the information that is conveyed by the available data/observations (here the 
neural activity) about the unknown parameter/stimulus (here the category). As 
a consequence, optimizing the code by maximizing the mutual information is also 
mandatory in order to optimize decoding. 

In addition, we have also ob tained that the asymptotic express ion (I3.25P of the 
mutual information, derived in iBonnasse-Gahot and Nadall (120081 ) . has a nice in- 



terpretation since it comes from the Cramer-Rao bound. This is to relate to, and 
contrast wit h, the case of the coding (or estimation) of a continuous stimulus (or 
parameter) (j Clarke and Barronl . Il990l ; iRissanenl . Il996t iBrunel and Nadall . 119981 ) . If 



the aim of the considered neural system is to encode a continuous parameter x, 
e.g. an orientation, in the large signal-to-noise ratio limit the mutual information 
(between the neural code and the parameter) is essentially given by the logarithm 
of the Fisher information, F co d e (x), o r more exactly stated by the logarithm of the 
bound of the Cramer- Rao inequality ( Brunei and Nadall . 1998 ). Here one gets that 
the mutual information (between the neural code and the category) is also expressed 
in term of the bound of the Cramer-Rao inequality, this bound being written for the 
estimation of the probability of the category (not of the category itself). 



It follows from the previous results that an optimal strategy for the neural system 
consists in (1) applying the 'infomax' principle to the coding layer; (2) building a 
decoding layer with M output cells such that, from the neural activity r of the 
coding layer, the \ith output cell has its activity precisely equal to the conditional 
probability P(/x|r). One should note, however, that optimization of the decoding 
layer may be done for a given, not necessarily optimized, coding layer. It might be 
the case that the coding layer is used for different related tasks, and/or that the time 
scale for adaptation of the encoding is large, ensuring some long term stability or 
robustness despite the need to face various temporary tasks. In the case of linguistic 
data to be analyzed later, the analysis will be consistent with the assumption that 
native speakers of a language have a well adapted neural representation of their 
phonetic categories, whereas non native speakers do not. 
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3.1.2 Network optimization 

In this section we consider the optimization of the decoding layer through a learn- 
ing procedure, this being done for a given coding layer (not necessarily optimized). 
The optimal estimator is searched for within a class of probability distributions 
g(.|r, w) = {g(fi\r, w), n = 1,...,M} that the neural system can implement, w de- 
noting the set of adaptable parameters (e.g. synaptic weights). 

For an optimally adapted neural network, we thus expect g(/i|r,w) to have a 
distribution with mean P(n\x) and with variance saturating the Cramer- Rao bound: 



For the considered neural model, we have seen that with of order 1 = (JVr)°, for 
consistency one must have v(g fM ) of order I /Nr. Since here F code (x) reads 



this Fisher information is of order Nt, hence the optimal variance given by (13.27P 
is indeed of order I /Nr. 

As for the weights w, although deriving general results sounds difficult, we expect 
the weights to be greater the further away the corresponding cell is to the category 
boundary: a cell 'vote' should indeed be more important if it is more confident. One 
way to see that is to consider from Eqs. (12. 6p and (13.26P that = r J2i w vifii x ) 
behaves, up to a constant, as lnP(/i|x); in the limit case of a continuum of cells 
with dirac delta function tuning curves, the weight function u> M (x) is thus also pro- 
portional to the log of the posterior probability P(fi\x). In the following numerical 
illustration, the weights are indeed found to be greater within a category than be- 
tween categories. 

One may ask whether the chosen neural architecture allows to approximate ef- 
ficiently the optimal solution. Actually general results on function approximation 
gives that a single 'hidden layer' (here the coding layer) is enough in order to ap- 
proximate any smooth enough function with an accuracy which can be as good as 
wanted with a large enough number (here N) of 'hidden units' (coding cells). In 
addition, making use of a very large number of coding/hidden units is in the line of 



9n = P(v\ x ) 
v(g^) = (P'(fi\x)) 2 /F code (x) 



(3.26) 
(3.27) 




(3.28) 
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the support vector machine (SVM) approach, which can be understood as projecting 
the inputs onto a large dimensional space, from which categorization becomes an 
easy task. It is likely that many different learning algorithms, supervised or un- 
supervised, may be able to achieve the optimal solution. For illustrative purpose, 
in the following numerical simulations we will make use of a particular supervised 
learning strategy. 



3.1.3 Illustration on two categories 



In this section, we illustrate our theory on the simplest example, that of two Gaus- 
sian categories. Recall that x represents the relevant (continuous) physical space 
in which the stimulus lies. In the case of vowels, one may think of the space of 
formants. For comparison with specific empirical data, one may take as proxy for 
x the 1-dimensional control parameter used in an experiment to make the stimulus 
changes continuously from one category to the other. For instance, in a face identi- 
fication experiment, this di mension is defined by the morphed continuum between 
two different 



Ylinen et al. 



aces [see e.g. 



Beale and Keil 



19951 ). In the psycholinguistic study by 



(120051 ). which will be studied in more depth in the following section, the 



control parameter is the vocalic durat ion. To fix ideas, consider the experimental 
study of iMcMurray and Spiveyl (120001 ). In this experiment, subjects are presented 
with a continuum of 9 stimuli, ranging from category /ba/ to category /pa/, and 
whose voice onset time (VOT) values vary from x\ = —50 ms to xg = 60 ms. The 
task is to identify the category by clicking the corresponding button on a screen. 
Using an eye-tracking method, this behavioral study measures the time spent by 
subjects looking at the two buttons after hearing a given stimulus. Here we can 
consider the VOT as the relevant x-space. 

We assumed the two categories to be equiprobable, and each one characterized 
by a Gaussian distribution, centered at x w = —2 and x M2 = 2, with a width a Ml = 
a M2 = 1.5 . These numbers are arbitrary and chosen for illustrative purpose only. 
For comparing the order of magnitudes with the ones in the experiment described 
above, one unit of the x space in the simulation corresponds to a difference in 
VOT of 13.75ms (the spacing between two consecutive stimuli), with the categories 



centered at x Ml = —22.5ms and x M2 = 32.5ms, and a width a 



,M2 



20.6ms. We 



considered a neuronal population with N = 14 coding cells. The activity r« of each 
neuron is given by a Poisson statistics with mean firing rate fi(x), corresponding to 
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a bell-shaped tuning curve: 



(x — x? 2 



fi{%) — fm'm 4~ (/max fm'm) exp ( ^2 J (3.29) 

The preferred stimuli of the cells are equidistributed over the domain [—6, 6] (which 
corresponds to VOTs in the range [—77.5ms, 87.5ms]). The width and the minimal 
and maximal values of the tuning curves are the same for all the neurons = 1.38 
(~ 19ms), /min = 0.001 and / max = 5). 

We ran a supervised learning phase in which a large number of stimuli x are pre- 
sented to the network along with their category label. Following each presentation, 
the parameters w are updated in order to minimize the training cost function 

Ct(x,r) = y> M (z)ln /, ; (3.30) 

where x is the presented stimulus, and the 'teacher value' t^(x) is 1 if the correct 
category is //, and otherwise. As shown in the Supporting Information, through 
averaging over the presentation of a large number of stimuli, this cost becomes iden- 
tical to the relative entropy between the true posterior probabilities and the output 
g(fi\r,w) (Eq. 12. 3p . Looking at the histogram of the values of g([i\r, w) (following 
learning) for different realizations of the activity r evoked by a given stimulus x (see 
Fig. [2]), we can notice the close proximity with the optimal theoretical curve given 
by the normal distribution centered in P(/i\x) and with variance P'(fj,\x) 2 /F code (x). 

The temporal evolution of the output of the network reflects the accumulation of 
the categorical information extracted from the neuronal activity. The learning phase 
was performed on a time window [0, r ] so that r a / max represents the mean number 
of spikes emitted by cell i during this time interval when the stimulus corresponds to 
its preferred stimulus. One can then look at the output g(/i|r, w) for different values 
of r G [0, t ]. Averaging over different realizations of this activity (1000 realizations 
in this numerical example), we finally get an estimate of the average value taken 
by the output g(fi\r, w) for each interval [0,t]. Figure E] (Left) shows the temporal 
evolution of the mean values of the output p(/x|r, w) for different stimuli along the 
continuum x\ = —50ms, . . . ,xg = 60ms (the curves getting redder and darker as 
t increases). For comparison, Fi gure [3] (Right) shows the resu lts from the above- 



mentioned experimental study of iMcMurray and Spiveyl (J2000|): one sees a gradual 



increase of categorical information, characterized by a sigmoid that expands over 
time, in qualitative compliance with our model. 
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Figure 2: Comparison between theoretical and numerical distribution of the posterior 
probability estimator. Histogram of the values of <?(//|r, w) (/x = 1) for 1000 realizations 
of the neural activity r evoked by a stimulus close to the boundary between the two 
categories. In red, the theoretical curve: a normal distribution centered in P(/j,\x) and 
with variance P' (n\x) 2 / F co & e (x) , predicting the values taken by an unbiased and efficient 
output. 
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Figure 3: Qualitative comparison of the temporal evolution of the decision between the 
model and as found in experimental data. (Left) Averaged temporal evolution of g(fi = 
2|r, w) along the continuum x±, . . . ,xg. The increase in the length of the time window 
[0, r] is indicated by a color gradient ranging from orange to dark red. (Right) Evolution 
of the proportion of looking time to the category /pa/ vs the category /ba/ for different 
stimuli whose voice onset time (VQT) values vary from x\ = —50 ms to xg = 60 ms (data 



extracted from 



McMurrav and Spivevl . [2000) 
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3.2 Reaction times 



This section analytically characterizes the mean reaction time following the identi- 
fication of a category as a function of the stimulus presented, and shows how the 
analysis of the previous sections allows to specify, and understand the origin of, the 
parameters of the diffusion model. 



3.2.1 Mean reaction times 

We want to express, in term of the threshold 7 and of the mean and variance of the 
diffusion process, the mean time Td(x) to reach one of the two bo unds. To do so we 



can a pply to our model the general results on first passage times (jWaldl . Il947t iLink . 



19921 ) . Applicati ons of the theory of fir st passage time in the field of neuropsychology 



are presented in IShadlen et al.l (J2006). The essential difference with these works is 
here the dependency of the variance in the stimulus. The general theory on first 
passage time applied to our framework leads to the following equation: 

^) = ^*'(^W) ( 3 - 31 ) 



where 

H> d (y) = -tanh(y) (3.32) 

y 

with $(f(0) = linij,_>o ®d{y) — 1- One can get some insight on the nature of this 
formula by considering an approximation which, although based on a two-lines ar- 
gument, gives surprisingly good results. First, to get rid of the sign, we consider the 
square a T (r) 2 of the decision variable. For a given time window [0,r], we average 
this quantity over the realizations of the neuronal activity given a stimulus x. We 
then define (an approximation of) the mean reaction time by the value of r such 
that the average of a^(r) 2 is equal to the square of the bound 7. In other words, 
we write 

<(a^) 2 L= 7 2 (3-33) 

where ( . ) z indicates the integration over r given x. Given the mean and variance, 
Eq. (12. lip and ( 12 . 12j) . one gets a second-degree equation for tJ, that is: tj 2 (a ^)) 2 -!- 
t2v®(x) — 7 2 = 0. The positive root of this equation gives tJ: 

= 7^ ®« (^M?) (3-34) 



v°(x) a V v°(x) 
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where 



1 



') 



(3.35) 



2y 2 



with $ a (0) = lim y _^o &a(y) = 1- Clearly the expressions (13.34p and (I3.3ip have 
the same structure. Despite the apparent dissimilarity between <3> a and these 
two functions have the same qualitative behavior as functions of their argument 
y, sharing the same asymptotic limits for both small and large values of y: both 
expressions for the mean reaction time give, for |o°| ^> v®(x)/j, Td(x) ~ r, , and 

2 

for la ! <C f°(x)/7, tZ(x) ~ -gVr- Note that the similarity between our expression 
(I3.34p and the exact one (13.311) is remarkable since, in our argument, the notion of 
first passage is not even used. 

3.2.2 Macro interpretation of micro quantities 

We have seen that the mean and variance of the diffusion process result from the 
aggregation of information from the very large assembly of neurons in the coding 
layer. We now want to make use of our analysis on the optimal network, done in 
the previous sections, in order to give the expression of these mean and variance in 
terms of macroscopic quantities. 

As we have shown, for the large N limit considered here, the activity of the first 
output unit g T (l\v, w), is characterized by a Gaussian distribution. The mean 
and variance i>(<7i )T ) of this distribution can be easily determined: 



and v(gi jT ) = (?i,t 2 (1 — 9i,t) 2 t 2 v^, which can be rewritten as 

v(9i,r) = ^ - v° a (3.37) 

where ' denotes the derivative with respect to x, and we recall that 57° and i>° are 
the mean and variance of the diffusion process. 

Now we have also just seen, Eq. (13.271) . that for the optimized network the mean 
gi >T and the variance v(gi )T ) are given by 



1 



(3.36) 



9l,T = 



1 + exp(ro7°) 



01,T« 



P{l\x) 
P\l\x) 2 



(3.38) 



(3.39) 
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where r a is the integration time used during the learning phase, and F® ode (x) is 
the Fisher information rate specific to the neural code, so that if we observe the 
neural activity during a time window [0,r], the Fisher information of the neuronal 
population writes as F co d e (s) = t F® ode (x). For the present neural model, this Fisher 
information rate is given in term of the tuning curves by 

F ^)-J2 iJ ^ ( 3 - 4 °) 

Making use of equations (I3.36P and (I3.37p . we then get the mean and variance 
of the diffusion process in term of macro quantities: 

v» = T^y (3-42) 

Given the expression of the bias (I3.4ip . one can also write the variance as 

V ° a{X) = P(l\x)P(2\x)^) (3 - 43) 

where F cat is the category- related Fisher information, Eq. (I3.2ip . Note that in ac- 
cordance with the previous analyzes, a°(x) is of order l/r a and v^(x) is of order 1/rf . 



This analysis gives one of the main results of the present paper. It makes it 
possible to better understand the respective role of the mean a°(x) and the vari- 
ance v^{x) in the decision process. For a given stimulus x, the diffusion bias o°(x) 
determines the mean direction taken by the decision variable towards one of the 
two bounds. This bias is given by the loglikelihood ratio favoring one hypothesis 
over another. This is in agr eement with previous wo rks on the Bayesian approach 



to decision making (see e.g. I Gold and Shadlenl . 120071 ). but note that here this is a 



result of the network optimization. Within a category, a°(x), either negative or 
positive depending on the category, is characterized by a large value, which rapidly 
leads the decision variable to the correct corresponding bound. Conversely, at the 
boundary between categories, a°(x) is zero: the trajectory of the decision variable 
is then an unbiased random walk. The quantity determines the amplitude of 

the randomness in the trajectory of the decision variable. It is proportional to the 
ratio of the category-related Fisher information to the coding Fisher information. 
Recall that these Fisher information values give the sensitivity to small variations 
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in the stimulus of, respectively, the category and the neuronal population. 



Application to Gaussian categories. If the categories are defined by Gaussian distri- 
bution with same variance, the quantity a°(x) is linear in x: 

a°(x) = b (x - x f ) (3.44) 

where &o is a scalar, and Xf represents the boundary between categories, defined as 
P(l\xf) = P(2\xf). In this case, v^(x) simply writes: 

v° a {x) = j^- (3.45) 

r codeW 

Introducing the parameter (3 = 7/60, the mean reaction time takes a simpler expres- 
sion: 

~ d {x) = P 2 F? ode (x) $ (/3 F c ° odc (x) (x - Xf )) (3.46) 

where, for the exact expression (13.311) . $ = Eq. (I3.32p . and $ = $ a , Eq. (13.351) . 
in the case of our approximation (I3.34|) . 

One can notice that Eq. (13.311) and (13.461) are (obviously) similar to those derived 
in previous models based on diffusion models. Notably, Eq. (13.461) . corresp onding 
to Ga ussian categories that lead to linear decision bounds, is the same as in lAshby 
(120001 ) (for identical absolute values of the negative and positive thresholds, and in 
the absence of 'criterial noise' - noise on the decision boundary). The key differ- 
ence is in the interpretation of the parameters, here derived from the hypothesis 
of optimal decoding. In particular, F® ode (x) is interpreted both in term of the dis- 
criminability measured in psych ophysic s , and in term of the neural sensitivity - 
hence subject to adaptation. In lAshbyl (120001 ). in place of the Fisher information, 
the parameter which appears in the formula also characterizes the variance in the 
perception of the stimulus, but its characteristics are assumed independent of the 
categorization task. In addition, in our result (13.461) . the constant &o is analytically 
determined, in particular in terms of the posterior probabilities P(fi\x). This makes 
it possible to better predict or analyze the behavior of reaction times as a function 
of the structure of the categories. For instance, considering two Gaussian categories 
with equal variance, increasing the variance of these distributions, which amounts 
to increase categorization uncertainty, results in longer reaction times, in a way that 
is quantified by our formula. 

We now apply the formula (I3.46P to data from a numerical simulation, and to ex- 
perimental data available in the psycholinguistic literature. 
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3.2.3 Numerical illustration 

We first test our theory with a numerical simulation on the simple case of two 
equiprobable Gaussian categories. The coding layer is composed of a (not so large) 
number of iV = 10 cells (see the Supporting Information for all the numerical de- 
tails). Given that we are interested in looking at the interplay between reaction 
times and discrimination, we here optimize both the coding layer and the decoding 
layer: the parameters of the tuning curves (width and location) in the coding layer 
are also optimized. 

Following learning, the behavior of the neural population, with respect to dis- 
crimination sensitivity and reaction times, qualitatively reproduces a classic sit- 
uation of categorical perception, as summarized in Figure HI Identification curves 
are characterized by an S-shape; mean reactio n times are longer at the boundary be- 



Pisoni and Tash 



tween categories than within category (see e.g 
19631 ); discrimination accuracy (as quantified by Fisher in 
highe r at the boundary between categories than wi t hin ( e .g. 



1974 



Studdert-Kennedy et al 



Repp 



1984 



Bornstein and Korda 



1984 



Goldstone 



1994 



'ormation F® nA Jx)) is 



Liberman et al. 
Kuhl and Paddenl 



1957 



198a ), 



which captures the so-called categorical perception phenomenon. 

Figure (Left) compares the mean reaction times obtained in the numerical 
simulation with the ones predicted from formula (I3.46P and (13.351) . We can first 
emphasize the remarkable correspondence (up to a scaling factor) between the sim- 
ulated data and the data predicted by our equation, despite the fact that there is 
only 10 cells in the coding layer. Using parameters of the linear regression extracted 
from Fig. [5], we can then reconstruct the mean reaction time for the whole con- 
tinuum. This reconstructed mean reaction time is shown on Figure [5] (Right, red 
line), together with the values obtained in the simulation (open circles). Here again, 
one can note the remarkable correspondence between the simulated and predicted 
values. Note though that the values given by our formula (see the x-axis in Fig. [5] 
(Left)) are smaller than the true values, hence the need in each case of rescaling the 
data in order to reconstruct the simulated reaction times. We attribute this bias to 
finite size and discretization effects. 



3.2.4 Modeling experimental data 

This section applies our theo ry to the modeling of mean reaction times obtained in 
the psycholinguistic study by lYlinen et al.l (120051 . Experiment 2). This experimental 
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Figure 4: Perceptual consequences of category learning: results of the numerical simula- 
tion. (A) Mean identification function. (B) Mean reaction times. (C) Fisher information 
rate of the neuronal population (measure of perceptual sensitivity). These results qual- 
itatively reproduce a classic s ituation of categ o ry lea rning, in particular in the case of 
phonemic perception (see e.g. iPisoni and Tashl . [197J, Fig. 3). Identification curves are 
characterized by an S-shape; mean reaction times are longer at the boundary between 
categories than within category; discrimination accuracy (as quantified by Fisher infor- 
mation F® odc (x)) is higher at the boundary between categories than within, ie the neural 
population exhibits categorical perception. 
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Figure 5: Reaction times: comparison between simulated data and theoretical 
prediction. (Left) Mean reaction times T^ mp obtained by numerical simulation for the 20 
stimuli spanning the considered continuum, as a function of the mean reaction times given 
by Eq. (|3.46p . The red line corresponds to the linear regression (correlation coefficient 
r=0.9986, p=1.7e-24). (Right) Mean reaction times as a function of the stimulus presented. 
The open circles indicates the mean reaction times obtained by numerical stimulation, 
whereas the red line corresponds to the results derived from Eq. (|3,46p . (|3.35|) . 

study compares the behavioral performances of two groups of subjects with respect 
to the perception of a phonological quantity based on duration. In this case, the 
two categories considered by the authors of this study are the two vowels /u/ (short 
vowel) et /u:/ (long vowel), the contrast being based on vocalic duration. For the 
first group of subjects (native speakers of Finnish), this contrast is phonemic, ie 
these subjects have a distinct representation of the two categories. For the second 
group of subjects (Russians), the vocalic quantity is not contrastive. All the subjects 
were tested on a continuum of 7 stimuli. 

A major interest for us here is that this study measures not only the reaction time 
during the identification of categories for each of the 7 stimuli along the continuum, 
but also the perceptual distance d! between adjacent stimuli as well as reaction times 
during the discrimination phase (see Fig. [6] for a reproduction of these data for the 
two groups of subjects). These two latter sets of measurements make it possible to 
evaluate Fisher information rate for the whole continuum. 

For a given stimulus x, the mean reaction time t(x) to categorize it is equal to the 
sum of the mean time r n d resulting from neural propagation and motor realization 
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Figure 6: Reproduction of the experimental data from lYlinen et al.l (|2005l . Experiment 
2). On the left, data corresponding to native speakers of Finnish; on the right, those 
corresponding to Russian speakers. (A) Identification function. (B) Mean reaction times. 
(C) Perceptual distance (d') between adjacent stimuli. 



(independently of the decision), and the mean time Td(x) characterizing the decision 
stage: 

r{x) = r nd + t2(x) (3.47) 

The mean time t2{x) is given by formula (13. 46p (using $ a here), and depends 
on three free variables: F c ° ode (x), Xf, and (3. We first determine the Fisher in- 
formation rate F® ode (x) thanks to experimental measures of d! and corresponding 
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mean reaction times (measured during the discrimination task). The Fisher in- 
formation rate Fj? nA Jx) is lin ked to the perceptual distance d! through (see e.g. 



Seung and Sompolinskvl . Il993l ): 

d' =\8x\y/F code (x) (3.48) 

where, here, 5x = 1. Moreover, as we have seen 

F C ode(x)=^ di ™F c ° ode (x) (3.49) 

where t/ 8 ™^) corresponds to the mean reaction times during the discrimina- 
tion task. For a given stimulus x, we compute the quantity 7^ dlscrim (x) thanks 
to the mean reaction times measured by the authors, equal to T^r dlscrim (x) = 
^-^discnm _|_ ^discnm^^ where 7^ dlscnm , the mean time resulting from neural prop- 
agation and motor realization, is independent of the decision, and is set to 250ms. 
Applying a piecewise cubic Hermite interpolation to the experimentally measured 
values, we obtain an estimation of d! and of TRT dmcrim (x), and thus of F c ° ode (a;), for 
all x in the continuum. 



Only three parameters are thus to be found in order to model the experimental 
data: T n d, Xf and 0. For each group, these parameters are finally obtained by min- 
imizing the least square error between experimental and predicted values. For the 
native speakers of Finnish, we get = 280, /3 fin = 339 and x^ n = 3.11 (r=0.996, 
p=1.7e-6), and for the Russian group, r£f = 278, /3 rus = 463 and x r / s = 3.85 
(r=0.959, p=6.5e-4). Figure [7J compares mean reaction times experimentally ob- 
tained with the ones predicted by formula (13.461) . optimized for each case. In the 
case of native speakers of Finnish (Figure [7] (Left)), alignment between experimental 
data and prediction is almost perfect. In the case of the Russian group (Figure [TJ 
(Right)), experimental data and predicted values line up remarkably well too. 
Interestingly, the value of /3 is found to be greater for the native speakers than for 
the non-native speakers. This parameter is equal to the ratio between 7, which 
is the decision threshold, and 60, which quantifies the separation between the two 
categories. Thus, assuming one of the other parameter constant between groups, 
/3 fin < /3 rus means that either the threshold is lower for the native speakers of Finnish 
than for the Russian group, or that the categories are more distinct for the natives. 
Both possibilities make sense here, given that we expect the native speakers to have 
a more accurate representation of the categories than the non-native speakers (the 
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Figure 7: Reaction times: comparison between experimental data and predictions from 
the model. On the left, the data corresponding to the native speakers of Finnish, on the 
right, those corr esponding to t he Ru ssian group. (Top) Mean reaction times experimen- 
tally obtained by 



Ylinen et al. 



(|2005l ) for the 7 stimuli spanning the considered continuum, 
as a function of the mean reaction times given by our formula, for the two groups of sub- 
jects. The red line corresponds to the y = x line (r=0.996, p=1.7e-6 for the native speakers 
of Finnish, and r=0.959, p=6.5e-4 for the Russian group). (Bottom) Mean reaction times 
as a function of the stimulus presented, for the two groups of subjects. The open circles 
indicate the mean reaction times obtained in the experiment for each stimulus, whereas 
the red line corresponds to the model prediction. 

vocalic contrast used in this experiment being phonemic for the former group, but 
not for the latter). 
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It is worth noticing that only three free parameters are used here, thanks to 
the complete characterization of the F® ode (x) quantity using discrimination mea- 
surements. Other models of categorization response times do not allow for such a 
characterization, and would thus require more parameters. 



4 Discussion 



4.1 Interplay between identification and discrimination 

The theory presented in this paper highlights the differences and relationships be- 
tween identification and discrimination. The identification of categories is based on 
the output of the decoder, defined by the associative weights w, whereas discrimina- 



Bonnasse-Gahot and Nadal 



tion p erformance is determined at the level of the coding layer. In 
(120081 ). we showed that following category learning, neural optimization results in 
more neural resources allocated at the boundary between categories, with the aim 
of maximizing mutual information between neural activity and categories. Here, we 
have seen that optimization of the properties of the neuronal population entails a 
reduction of the uncertainty in the estimate of the posterior probabilities P(n\x), 
which is particularly relevant in the transition regions between categories, and makes 
it possible to minimize classification errors. 

This distinction is illustrated by the differences in the perception of a native speaker 
and of a second-language learner. A second-language learner has to associate sounds 
with new categories, ie she has to build a decoder. After a learning phase, assuming 
no interference with existing category representation, this individual might then be 
able to correctly assign a label to the sounds she hears, thus presenting a response 
similar to the one produced by a native speaker. Nevertheless, this second- language 
learner will not necessarily exhibit a better discrimination at the boundary between 
categories. In contrast, due to a more intensive experience and because the neural 
investment is behaviorally more relevant, a native speaker will typically exhibit a 
discrimination peak at the boundary between categories, which is a perceptua l con- 
sequence of an optimized neural code (or 'neural commitment', as iKuhll (120041 ) puts 
it. 

his s ituation fi nds some experime n tal suppo r t in th e study by 



(120081 ) (see also 



Halle et al 



Heeren and Schouten 



2004; IXu et al 



20061 ). Following an identification of 



categories, Dutch learners of Finnish exhibit a response curve similar to native speak- 
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ers of Finnish, whereas naive Dutch speakers do not. Their discrimination curves 
however do not present a peak at the boundary, contrary to the native speakers. 
Following our analysis, more training and more language experience should lead a 
second- language learner to optimize her perceptual map, so as to better perceive 
fine variations at the class boundary. This is indeed the case: contrary to first- and 
second-year students, only third-year students present a discrimination peak at the 
category boundary. 

Distinction between identification and discrimination is also reflected by reaction 
times. It is well known that reaction times follow some positive funct ion of uncer- 



Pisoni and Tash 



tainty : they are longer at the class boundary than within a category. As 
(119741 ) note, the shape of the reaction times qualitatively follow the shape of the 
discrimination, typically greater between categories. We have seen though that this 
is not necessarily the case. Our result indeed show that longer reaction times at 
the boundary are inherent to the identification process, independently of a discrim- 
ination peak. We have exhibited yet a quantitative link between reaction times 
and discrimination accuracy (see Eq. (I3.34p ). showing that better discrimination 
implies longer reaction times (everything else being equal). We can thus predict 
that better discrimination at the boundary between categories results in a shape 
of reaction times that is sha rper and with larger amplitude, which is supported by 
several experimental results ( Halle et all 2004 : 



Yhnenetal 



2005|). 



4.2 Neurophysiological data 

In the studied model, a neural map encodes categorical information in a distributed 
fashion, so that if one only looks at a particular individual neuron, little information 
is conveyed, and the shape of the tuning curve does not reflect a categorical code. 
The influence of categorization on the neuronal properties has to be evaluated at 
the population scale. Conversely, the output cells, involved in the decision process, 
code in a more direct way for the categories. Their activities indeed follow the pos- 
terior probabilities related to the categories: a given cell responds similarly within 
a category and sharply differently between. This situation finds biological support 
in recent neurophysiological studies. 

In particular, in the case of the visual system, the inferotemporal cortex (IT) en- 
codes information in a distributed way, with a population coding strategy, and 
feeds downstream prefrontal (PFC) regions characterized by more categorical re- 
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sponses. Several studies have shown that category learning modify the neuronal 



properties of the IT popu l ation (ISigala and Logothetis 



Kriegeskorte et all 



2008 



2002 



De Baene et al. 



2008 



Op de Beeck et all 120081 ). In their stu dy on the influ- 



ence o f categorization in the inferotemporal and prefrontal cortices, iFreedman et al 



(120031 ) conclude that there is (almost) no categorical information in the inferotem- 
poral cortex, whereas cells in the prefrontal do show categorical specificity. These 
arguments are based on a measure of categorical selectivity at the single neuron 
scale, which might potentially overshadow information collectively conveyed by the 
whole population. Several studies have yet shown an influence of category learning 
on neuronal properties in the inferotemporal cortex, and insist on the distributed 
coding stra tegy employed in the infe r otemporal and the more individual code in th e 



prefrontal fjOp de Beeck et al. 



2008 



Mevers et al. 



2008 



Kriegeskorte et all l2008[ ) 



Similarly, the MT (middle temporal) region, thought to play a major role in the 
perception of motion and in the guidance eye movements, is well modeled as a large 
population of direction-specific cells. 

Located downstream of the inferotemporal cortex, the prefrontal cortex is known 
as a site for superior cognitive functions, notably decision-making. Several studies 
show that neurons in the prefrontal cortex have an activity that more directly reflects 
category membersh ip, and which is not much aff ected by the physical properties of 



the stimulus itself ( IFreedman et al 



2001 



2002). These neurons have typically a 



step-like tuning curve (or its continuous counterpart, an S shape), and exhibit a 
strong categorical selectivity at the individual level. We can also evoke here the 



existence of neur ons that re s ponds categorically 
auditory cortex 



Ohl et al. 



( Salinas and Romo 



1998 . 



2001 



Prather et al. 



ollowin g category learning, in both 



2009) and primary motor cortex 



Concerning the decision mechanism and reaction times, several studies published in 
the past decade have brought quantitative support to the kind of diffusion model 
we used here, for which neuronal activity represents accumulation of evidence in 



the c 


ecision ( 


Kim and Shadlen 


. 1999; 


Shadlen and 


Mewsome , 


2001; 


Heekeren et al.. 


2004; 


Huk and Shadlen, 


2005; 


Smith and Ratcliff 




2004; 


Gold and Shadlen. 


2007). 



In the case of random dot discrimination task experiments where the decision is 
made through eye movements, the LIP (lateral intraparietal) region, which re- 
ceives inputs from MT, has be en identified as the locus of such decision mechanisms 
(IShadlen and Newsomd . 120011 ). 
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In our modeling, we have assumed uncorrelated cells (conditional to the stimulus). 



For the coding stage, t 



2008 



l e mai n results hold or are easily generalized (IBonnasse- Gahot and Nadal 



Bonnasse- Gahot 



20091 ). whenever the noise correlations preserve the scaling of 

y (F rnd ,(x) ~ N) - which 



the Fisher information with the size of the neural assemb 



is known to be the case 



Yoon and Sompolinskyl . 



or a large family of correlations ( lAbbott and Dayanl . 11999 ; 



19991 ). However, the hypothesis of uncorrelated cells plays 



an important role for the decoding layer, for which the results explicitly need that 
the output cells sum independent random variables. Experimental results in favor 
of diffusion models are actually easily understood if this is the case. However some 
experi mental works strong ly suggest that important correlations exist in the coding 
layer fjZohary et all Il994l ). To conciliate such results with the observed activities 
in the decoding areas, some authors proposed that the cells might sum a small 
numb er of well chosen cells in the coding layer ( jZohary et all [1994; 



Britten et al. 



19921 ). We have seen that, in the numerical simulations, the results obtained with 



a rather small number of independent cells are already in good agreement with 
the analytical results assuming a large number of cells. An alternative but related 
scenario is to assume that the correlations do decrease sufficiently fast with the dif- 
ference in preferred stimuli, so that the effective number of independent cells seen 
by the decoding layer is of order N/R, where R is the typical scale of the corre- 
lations. Provided N ^> R, one may expect the results presented here to apply as 
well. In addit i on, the existence o f stron g correlations has recently been challenged 



(IRenart et al 



2010 



Eckeretal 



20101 ) . from analyses based on both theoretical 



and experimental approaches. Such controversial issue needs to be resolved by new 
experiments, and specific studies of optimal decoding with correlations remain to be 
done. 

In any case, it thus appears that for different modalities and categorization tasks, 
the same global scheme is found: a distributed encoding with a large population of 
feature-specific cells, a read-out layer and a decision mecha n ism - with a diffusion 



or accumulator mechanism. In 



Bonnasse- Gahot and Nadall (120081 ) . we discuss the 



relevance of our approach to the modeling of, e.g., the IT neural assemblies as cor- 
responding to the coding neural cells. Here our main results concern the decoding 
layer (e.g. PFC, LIP), and link this decoding layer to the coding stage. In particu- 
lar, from our theory, one should find that both the bias and variance of the random 
walk process have a dependency in the stimulus. More precisely, these parameters 
should be related to the class probabilities (given the stimulus) and to the sensitivity 
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of the neural code with respect to the stimulus. 
4.3 Concluding remarks 

When dealing with a difficult categorization task, the brain has to face two indepen- 
dent sources of uncertainty: categorization uncertainty and neuronal uncertainty. 
The latter stems from neuronal noise, whereas the former is intrinsic to the category 
structure in stimulus space: categories like phonemes or colors typically overlap, 
so that a given stimulus might belong to different categories. Here, we propose a 
general neural theory of category coding, in which these two sources of uncertainty 
are quantified by means of information theoretic tools. We analytically show how 
these two quantities combine at both coding and decoding stages of the information 
process. Considering optimal representations, we derive formulae which capture 
different psychophysical consequences of category learning - namely, a better dis- 
crimination between categories, and longer reaction times to identify the category 
of a stimulus lying at the category boundary. Finally, we analytically relate micro- 
scopic quantities (neural properties) to macroscopic quantities that are behaviorally 
measurable (discrimination accuracy): this allows us to model experimental data, in 
the present work taken from the psycholinguistic literature. A major contribution of 
this work is thus to exhibit, in both quantitative and qualitative terms, the interplay 
between discrimination and identification, thanks to a global approach which links 
the 'top-down' one - the ideal observer approach where one compares the behav- 
ioral performance to the optimal ones -, with the 'bottom-up' one - the building of 
a neural code starting from the stimulus space. 

The stimulus structure is here formalized within a probabilistic framework, and we 
considered a neural architecture aiming at extracting the categorical information. 
The stimulus is encoded by a large population of stimulus-specific neurons, and the 
decoding is achieved by a layer of category-specific cells. We have shown that the 
output of these cells can estimate the posterior probabilities giving the likelihood of 
the classes knowing the stimulus (in the simulations we considered a particular super- 
vised learning scheme, but one can expect that others, supervised or unsupervised, 
can achieve the same results). Minimizing the Kullback-Leibler distance between the 
true probabilities and the output of the network leads not only to build an unbiased 
and efficient estimate of these probabilities but also, if the properties of the neuronal 



34 



population are optimized, to maximize the mutual information between the activ- 
ity of this neuronal population and the categories. Wi thin such context, allocating 
more neural resources at the boundary between classes fjBonnasse-Gahot and Nadall . 



20081 ) makes it possible to minimize classification errors at the boundaries, thanks 



to a better estimation of posterior probabilities. We have restricted the analysis to 
the case of a one-dimensional input: the present theory can easily be ge neralized to 
multidimensional inputs, as done in iBonnasse-Gahot and Nadall ( 120081 ). The issue 
of correlations in the coding layer and its impact on decoding deserves more studies, 
as discussed in the above 14.21 section. 

As explained previously the presented neural model share with others the same skele- 
ton, but is simple enough to allow for analytical results. The latter quantify the 
efficiency in the categorization task, and give the best possible performances that can 
be achieved through learning - the specific issue of learning being not addressed here. 
Despite this (relative) mathematical simplicity, the model preserves a strong biolog- 
ical plausibility. The coding/decoding architecture receives support from several 
recent experimental results in neurophysiology. A neuronal population encodes cat- 
egorical information in a distributed fashion and then feeds downstream regions that 
use this information to realize higher cognitive functions, such as decision-making. 
In the particular case of the visual system, this situation corresponds respectively 
to the IT and PFC (as found in visual object categorization tasks), as well as to 
the MT and LIP regions (as found in random dots experiments). At the level of 
the inferotemporal cortex, categorical information is distributed among the whole 
neuronal population, so that each neuron taken individually is not category specific. 
Conversely, in the prefrontal cortex, category membership is more explicitly repre- 
sented at the level of a single cell, where information gets accumulated over time. 
This decisional process is here modeled by a diffusion model: a random variable, 
the difference in output activities, evolves over time until it reaches a certain bound, 
positive or negative, leading to the corresponding response. We have analytically 
characterized the mean reaction time necessary to the establishment of the deci- 
sion, relating it to both a quantity that measures the degree of membership of the 
stimulus to one or the other category, and to the Fisher information quantifying the 
perceptual sensitivity in a discrimination task. We have shown that the formula we 
derived account for experimental data obtained in the psycholinguistics literature 



(lYlinen et al. 



20051 ). This comparison is however based on data that involve a small 



number of stimuli and, more importantly, are only averages of the performances of a 
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group of subjects. In order to test our model more precisely, future research should 
gather detailed individual behavioral data on discrimination accuracy and reaction 
times. Experiments on animal could provide the same type of data together with 
measurements of neural responses, at both the encoding and decoding level, so as 
to test the interplay between these two stages as we have discussed in this paper. 
Experiments should focus on the smooth transition between categories which, in 
view of our analysis, is the most relevant region to reveal both the sensitivity of 
neural code and the related shape of the reaction times. 

Finally, we mention that within our framework the modeling of random dot exper- 
iments requires to consi der the extens ion to a time- fl uctua ting; multi-dimensional 



stimulus - in the vein of 



Ashbvl ()2000|) or 



Beck et al 



(J2008|). More importantly, it 



requires a specific analysis: as the level of coherence changes, almost every quantity 
(both Fisher information values, the bias and the variance of the diffusion process) is 
changed. We leave to further work the study of the resulting dependency of reaction 
times in the coherence. 
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A Network optimization: supervised learning scheme 



For the numerical simulations, we made used of a supervised learning scheme which 
we present here, proving that, in the asymptotic limit of a very large training set, 
the chosen cost function gives the cost C considered in the theoretical analysis. 

During learning, stimuli are presented sequentially, along with their category 
label. For a given stimulus x, the output g(/z|r,w) is compared with the desired 
binary output given by indicator function t^ix) (for teacher), defined as follows: 

, . [ 1 if x & u , k s 

t,(x) = (A.l) 
[ (J otherwise 

where x e fM means that stimulus x belongs to the category labeled \i. The dis- 
tance between the output g(fi\r, w) and the teacher value t^x), is measured by the 
following training cost function: 

Its average over all the realizations of the neural activity r is given by: 

Ct(x) = J ^rP(r|o;)f;^(x)ln-^L (A.3) 

Let us now show that a large number of stimulus presentations during the learn- 
ing phase le ads to estimate p osterior probabilities (in a way similar to the one 
presented in Duda et al. . 200lL 

After n stimulus presentations, the mean cost function becomes: 

-YC t {x) = - I d N vYP{r\x)YtJx)\n ^ , (A.4) 

= ~ I d N v ^^P(r|x)ln^(/x|r,w) (A.5) 

= -Y fd N r^— Vp(r|a;)Wdr,w) (A.6) 
J n n u / — ' 

/i ^ x£fi 

where is the number of stimuli labeled /i among the n stimuli that were presented 
to the network. 

For a very large number of stimuli, the mean cost Ct then writes: 

C~ t = lim -YCt(x) = - Y I rfiVr <?M / dx P(x\^)P(r\x) hWdr, w) (A.7) 
n— >oo n — 4 J J 
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hence, given that J dx P{x\fi)P(r 
g M P(r|/i) = P(r)P(/i|r), we get 



x 



P(r|/i), and that, according to Bayes rules 



C t 



d N r P(r) V P(/x|r) In g(p\r, w) 



(A.8) 



This is the same as C except for a constant additive term (the entropy H(fi\x)), 
implying that minimization of the cost leads to estimate the posterior probabilities, 
as desired. 

In the numerical illustratio ns, learning is done through a gradient descent algo- 



rithm ( jRumelhart et al. 



19861 ) aiming at minimizing the cost function (1A.2[) . with 



the presentation to the network of 30000 stimuli along with their category label. 



B Reaction times: numerical details 

This section gives the numerical details corresponding to the simulation presented in 
section 13.21 This numerical example involves two equiprobable Gaussian categories, 
centered in x^ 1 = —3 and x Ml = 3, with standard deviation a m = a^ 2 = 1.5. The 
neuronal population (coding layer) is made of A" = 10 cells, with bell-shaped tuning 
curves, 

2 Q 2 ' ) ' ^ B ' 1 ^ 

The preferred stimuli Xi of the neurons are initially equidistributed along the domain 
[—6,6]. Before learning, each tuning curve has the same width (a, = 2). Minimal 
and maximal values of the firing rates are respectively set to f mm = 0.001 and 

/max 5. 

During the learning phase, 100000 stimuli are presented to the network, and both the 
weights w and the parameters of tuning curves (width and location) are optimized. 
The time window r a used during learning is equal to 1. 

After learning, we look at the response of the network following the presentation 
of a stimulus, according to the diffusion model presented in Section 12.2.31 The 
simulation of this diffusion process is done as follows. We first generate a Poisson 
process by dividing the time interval [0, 3r a ] into 3000 bins. For a neuron i, each 
interval, of width dr = r a /1000, receives a spike according to a Bernoulli law of 
parameter ff(x) dr {dr being small, we thus get a Poisson process associated with 
each neuron). We then compute the temporal evolution of the output a T as well as 
the time ra for which this quantity reaches one of the two bounds for the first time. 
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In this numerical example, the bound 7 is set equal to 0.3. For each stimulus x, this 
process is run 10000 times, which makes it possible to have an estimate of the mean 
reaction time Td(x). In the end, this operation is done for 20 stimuli equidistributed 
along a continuum ranging from —4 to 4. 
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